This report explores a dataset containing characteristics and chemical attributes of different white wines. In the dataset, data of about 4,900 wines and 13 variables are gathered. In this exploration, we want to analyze the influence of different attributes of the wine to its quality.
In this section, we want to perform a superficial exploration of the dataset of wines. So, we display the dimension, structure and summary to gain information for further analysis.
Dimension of the dataset:
## [1] 4898 13
Original variables (columns) of the dataset:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Process the dataset by dropping the “X”-column, which is not necessary, and adding a new ordinal variable for quality. New dimensions and variables of the dataset:
## [1] 4898 13
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "quality.ord"
To get an overview of the structure of the wine dataset.
## 'data.frame': 4898 obs. of 13 variables:
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## $ quality.ord : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...
Now, we have 12 informative variables of the 4,898 wines in our dataset.
The input variables are:
fixed acidity (tartaric acid) - [g/L]
volatile acidity (acetic acid) - [g/L]
citric acid - [g/L]
residual sugar - [g/L]
chlorides - [g/L]
free sulfur dioxide - [mg/L]
total sulfur dioxide - [mg/L]
density - [g/cm^3]
pH [1]
sulphates (potassium sulphate) - [g/L]
alcohol - [%]
Each input variable is a numerical value and descirbes a physical or chemical quantity containing information about the amount or proportion of a chemical entity. Additionally, we add another input variable to the dataset. The “bound.sulfur.dioxide” can be derived from the total.sulfur.dioxide and the free.sulfur.dioxide.
The output variable is:
This variable is a measure for the quality of wine given as an integer value from 0 (worst) to 10 (best).
A statistical overview of the raw data is given as follows:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality quality.ord bound.sulfur.dioxide
## Min. :3.000 3: 20 Min. : 4.0
## 1st Qu.:5.000 4: 163 1st Qu.: 78.0
## Median :6.000 5:1457 Median :100.0
## Mean :5.878 6:2198 Mean :103.1
## 3rd Qu.:6.000 7: 880 3rd Qu.:125.0
## Max. :9.000 8: 175 Max. :331.0
## 9: 5
We can see that the ordered categorical output variable is displayed by counting the values for each category. the quality of 6 is most common in the dataset.
In ths section, each variable is analyzed separately to get information about how it is distributed. Are there any anomalies (e.g. characteristics, outliers, etc.) within a variable? Should we perform a transformation of a variable to interpret the data? Therefor, histograms (at linear and logarithmic scale) and boxplots of each variable are created. To clear the data, a second section is added, where we define a threshold for outliers and remove data that do not fit in the distribution. Another visualization of histograms and boxplots is also given for the cleaned dataset. Additionally, a summary of each variable of the clean dataset is displayed.
The histograms and boxplots of the raw data of fixed.acidity are given as follows:
As shown in the boxplot above, we can define all fixed.acidity values above 10.7 g/L as outliers. We create a new dataset wines_new with the same structure as the original wines dataset. In this new dataset, we remove the defined outliers. In further plots, the new dataset is used.
The histograms and boxplots of the clean data of fixed.acidity are given as follows:
This summary calculates the statistical parameters of the new dataset.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.851 7.300 10.300
For the most part, the “fixed.acidity” variable is normally distributed. However, there are some outliers that have been removed. The median of “fixed.acidity” is 6.8 g/L, while the interquartile range is 1 g/L.
The histograms and boxplots of the raw data of volatile.acidity are given as follows:
As shown in the boxplot above, we can define all volatile.acidity values above 0.75 as outliers. This data is removed for further analysis in the new dataset.
The histograms and boxplots of the clean data of volatile.acidity are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2767 0.3200 0.7400
The “volatile.acidity” is a by-product of fermentation of the wine. The distribution of this variable also shows a normal distribution around the median 0.26 g/L, but has a long tail to the right side of the histogram compared to the “fixed.acidity”.
The histograms and boxplots of the raw data of citric.acid are given as follows:
As shown in the boxplot above, we can define all citric.acid values above 1.0 as outliers. This data is removed from the dataset for further analysis
The histograms and boxplots of the clean data of citric.acid are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3338 0.3900 1.0000
For the most part, the “citric.acid” variable is normally distributed around the median 0.32 g/L. However, there are a few outliers larger than 1.0 g/L that have been removed Another anomaly is the peak at 0.5, which does not fit in the normal distribution behaviour of the plot.
The histograms and boxplots of the raw data of residual.sugar are given as follows:
As shown in the boxplot above, we can define all residual.sugar values above 60 as outliers.
The histograms and boxplots of the clean data of residual.sugar are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.385 9.900 31.600
The distribution of the density of residual sugar within the wine shows an interesting behaviour. The maximum is at the very left side of the plot followed by a long tail to the right. Transforming the variable to a logarithmic scale, we can see that there are two peaks within the distribution distinguishing two groups of wine: wines with a low amount of residual sugar (“dry wines”) and ones with a higher amount (“sweet wines”). Thus, we can regard the distribution of residual.sugar as bimodal. The median of this variable is 5.2 g/L.
In further analysis, we consider four types sweetness for wines as described at wikipedia: - sweet (residual.sugar > 45.0 g/L) - medium (12.0 g/L < residual.sugar < 45.0 g/L) - medium_dry (4.0 g/L < residual.sugar < 12.0 g/L) - dry (residual.sugar <= 4.0 g/L)
## dry medium_dry medium sweet
## 2088 1966 825 0
We can see that there are no wines that match with the category “sweet”. Most of the wines are dry or medium_dry.
The histograms and boxplots of the raw data of chlorides are given as follows:
As shown in the boxplot above, we can define all chlorides values above 0.1 g/L as outliers.
The histograms and boxplots of the clean data of chlorides are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04200 0.04313 0.05000 0.09900
The distribution of the “chloride” variable also shows a normal distribution around the median of 0.042 g/L, but also has a long tail to the right side of the histogram with some outliers, which have been removed.
The histograms and boxplots of the raw data of free.sulfur.dioxide are given as follows:
As shown in the boxplot above, we can define all free.sulfur.dioxide values above 120 mg/L as outliers.
The histograms and boxplots of the clean data of free.sulfur.dioxide are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.03 45.00 118.50
The distribution of free.sulfur.dioxide is also normally distributed. Some outliers create a tail to the right side of the plot. The median is at 34 mg/L or 0.034 g/L.
The histograms and boxplots of the raw data of total.sulfur.dioxide are given as follows:
As shown in the boxplot above, we can define all total.sulfur.dioxide values above 300 as outliers.
The histograms and boxplots of the clean data of total.sulfur.dioxide are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 137.6 167.0 282.0
The concentration of total sulfur dioxide is normally distributed. The median is at 134 mg/L or 0.134 g/L. The IQR is from 108 mg/L to 167 mg/L.
The histograms and boxplots of the raw data of bound.sulfur.dioxide are given as follows:
As shown in the boxplot above, we can define all bound.sulfur.dioxide values above 220 mg/L as outliers.
The histograms and boxplots of the clean data of bound.sulfur.dioxide are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 77.0 100.0 102.4 124.0 211.0
The distribution of bound.sulfur.dioxide is analogous to variable total.sulfur.dioxide normally distributed. This is obvious because we derived this variable from total.sulfur.dioxide. The median is at 100 mg/L or 0.1 g/L.
The histograms and boxplots of the raw data of density are given as follows:
As shown in the boxplot above, we can define all density values above 1.01 as outliers.
The histograms and boxplots of the clean data of density are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0030
## [1] 8.495685e-06
The density of wine shows a very small variance (8.50e-06) within its distribution, which is in most parts normally distributed. However, some outliers can be identified at 1.01 and 1.04. The median is at 0.9937 g/L, which is comparable with the density of water (1 g/L).
The histograms and boxplots of the raw data of the pH values are given as follows:
As shown in the boxplot above, we can define all pH values above 3.8 as outliers.
The histograms and boxplots of the clean data of the pH values are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.189 3.280 3.800
The “pH” variable is normally distributed around the median 3.18.
The histograms and boxplots of the raw data of sulphates are given as follows:
As shown in the boxplot above, we can define all sulphates values above 1.0 as outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2300 0.4100 0.4700 0.4894 0.5500 1.0000
The “sulphates” variable is normally distributed around the median 0.47 g/L but a little right skewed.
The histograms and boxplots of the raw data of alcohol content are given as follows:
As shown in the plot above, all alcohol values are within the range of the boxplot. Thus, no outlier removal is required.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.55 11.40 14.20
## [1] 1.512583
The volume percentage of alcohol within the wine is not normally distributed. The range is between 8% to 14% and has a wide variance of about 1.514. The median is at 10.4 %.
In further analysis, we divide the alcohol variable in several parts using the computed quartiles. Thus, we obtain:
## low medium_low medium_high high
## 1105 920 1001 937
The histograms and boxplots of the raw data of the output variable quality are given as follows:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.901 6.000 9.000
The output variable “quality”" seems to be normally distributed around a quality of 6, which is also the most common value in the histogram and table above. The range of the given values only reaches from 3 to 9. Thus, we have only 7 distinct values, whereby 9 and 3 are not very frequent. The median of the distribution is also the third quantile limit (6).
A brief analysis of the distribution of the overall dataset and each variable in detail can be found in the section above.
The “wines” dataset contains 4,898 observations of white wines structured in 11 numerical attributes: fixed acidity [g/L], volatile acidity [g/L], citric acid [g/L], residual sugar [g/L], chlorides [g/L], free sulfur dioxide [mg/L], total sulfur dioxide [mg/L], bound sulfur dioxide [mg/L], density [g/cm^3], pH [1], sulphates [g/L], alcohol [%]. Furthermore, we have to categorical variables with residual.sugar.bucket and alcohol.bucket, which categorize the residual.sugar and the alcohol variable into several different groups.
The output variables are the quality and the ordered factorized variable quality.ord that represent the quality of the white wine.
The main features of the dataset is the impact of the acidity (fixed, volatile, citric acid, pH), sweetness (residual sugar), sulfur concentration (free, bound and total sulfur dioxide, sulphates) and alcohol content within the wine on the quality. The other variables (density and chlorides) are grouped into “others”. The aim of the further analysis is to analyze these variables groupwise and detect how the change of these variables can change the quality of white wines.
We also compared the statistical values of the variables. To get information about the white wines’ ingredients and their concentrations and composition, we used the median within the summary of the cleaned variables.
concentration: 1. fixed.acidity - 6.80 g/L
residual.sugar - 5.20 g/L
sulphates - 0.47 g/L
citric.acid - 0.32 g/L
volatile.acidity - 0.26 g/L
total.sulfur.dioxide - 0.134 g/L
6.1. bound.sulfur.dioxide - 0.100 g/L
6.2. free.sulfur.dioxide - 0.034 g/L
others:
alcohol - 10.4 %
pH - 3.18
density - 0.9937 g/L
quality - 6
The fixed.acidity has the heighest concentration of all chemical substances followed by the residual.sugar. A change of one of these variables might have an impact on the change of the wine’s quality. Thus, we should focus on the acidity caused by the fixed.acidity and the sweetness of the wine, given by the amount of sugar. Here, the bimodality of the residual.sugar concentration also has to be explored.
Other variables affecting the quality of wine are the “Level of Dryness” and the bitterness (e.g. the proportion of tannin within the wine). These two are also able to influence the wine’s quality und should be taken into account in the frame of an exploration.
We dropped the X variable, since it does not contain any information about the wine dataset. Furthermore, a quality.ord variable was created. This variable describes the quality as an ordered categorical variable for further analysis. Additionally, the variables residual.sugar.bucket and alcohol.bucket were created to cut the corresponding variables into categories. At least, the variable bound.sulfur.dioxide was derived from the total and free sulfur dioxide concentration.
A new dataset was created, which contains the original dataset without any outliers.
The analysis of the distributions can be found in the section above within the summary of each variable. For the most part the variables are normally distributioned with the exception of residual.sugar and alcohol. Anomalies like outliers and cumulative occurrence of certain values are mentioned in the corresponding section underneath each plot. Additionally, outliers are removed.
The bivariate plots section contains a general overview of the relationships between the variables, a detailed description of the correlation of the wine’s quality to other variables visualized a scatterplots and boxplots and a subsection in which selected proportions of variables are described and plotted with respect to the output variable.
To identify relationships between variables within the dataset, we compute a correlation matrix and visualize a matrix of plots of the given dataset at first.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.04 0.30
## volatile.acidity -0.04 1.00 -0.16
## citric.acid 0.30 -0.16 1.00
## residual.sugar 0.07 0.04 0.12
## chlorides 0.09 0.02 0.04
## free.sulfur.dioxide -0.04 -0.11 0.12
## total.sulfur.dioxide 0.09 0.07 0.16
## density 0.26 -0.01 0.17
## pH -0.43 -0.03 -0.15
## sulphates -0.01 -0.06 0.09
## alcohol -0.13 0.08 -0.08
## quality -0.11 -0.18 -0.01
## bound.sulfur.dioxide 0.13 0.13 0.13
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.07 0.09 -0.04
## volatile.acidity 0.04 0.02 -0.11
## citric.acid 0.12 0.04 0.12
## residual.sugar 1.00 0.25 0.35
## chlorides 0.25 1.00 0.13
## free.sulfur.dioxide 0.35 0.13 1.00
## total.sulfur.dioxide 0.43 0.34 0.62
## density 0.84 0.46 0.34
## pH -0.21 -0.04 -0.02
## sulphates -0.02 0.07 0.08
## alcohol -0.48 -0.51 -0.27
## quality -0.10 -0.28 0.03
## bound.sulfur.dioxide 0.36 0.35 0.28
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.09 0.26 -0.43 -0.01 -0.13
## volatile.acidity 0.07 -0.01 -0.03 -0.06 0.08
## citric.acid 0.16 0.17 -0.15 0.09 -0.08
## residual.sugar 0.43 0.84 -0.21 -0.02 -0.48
## chlorides 0.34 0.46 -0.04 0.07 -0.51
## free.sulfur.dioxide 0.62 0.34 -0.02 0.08 -0.27
## total.sulfur.dioxide 1.00 0.56 -0.01 0.14 -0.47
## density 0.56 1.00 -0.11 0.09 -0.81
## pH -0.01 -0.11 1.00 0.16 0.12
## sulphates 0.14 0.09 0.16 1.00 -0.03
## alcohol -0.47 -0.81 0.12 -0.03 1.00
## quality -0.16 -0.31 0.10 0.06 0.43
## bound.sulfur.dioxide 0.93 0.53 0.00 0.13 -0.45
## quality bound.sulfur.dioxide
## fixed.acidity -0.11 0.13
## volatile.acidity -0.18 0.13
## citric.acid -0.01 0.13
## residual.sugar -0.10 0.36
## chlorides -0.28 0.35
## free.sulfur.dioxide 0.03 0.28
## total.sulfur.dioxide -0.16 0.93
## density -0.31 0.53
## pH 0.10 0.00
## sulphates 0.06 0.13
## alcohol 0.43 -0.45
## quality 1.00 -0.21
## bound.sulfur.dioxide -0.21 1.00
The exploration above shows which variables are positively or negatively correlated to each other and variables without any correlation. For our variable of interest - the quality - we can summarize the correlation as follows:
alcohol 0.44
pH 0.09
sulphates 0.06
free.sulfur.dioxide 0.03
citric.acid -0.01
residual.sugar -0.10
fixed.acidity -0.11
volatile.acidity -0.17
total.sulfur.dioxide -0.16
bound.sulfur.dioxide -0.21
chlorides -0.29
density -0.32
The table above shows the correlation coefficient of the variables with respect to quality in a decreasing order. We can see that there is no strong correlation. The best correlation exists between quality and alcohol (r = 0.44). The absolute correlation coefficient of pH, sulphates, free.sulfur.dioxide and citric.acid is smaller than 0.1. Thus, we can assume, that there is only little or no correlation with quality.
The strongest negative correlation can be identified between quality and density (-0.32) followed by the concentration of chlorides (-0.29) and bound.sulfur.dioxide (-0.21)
In this section, we want to explore the correlation of quality with other variables. Therefor, we consider different groups as mentioned in the section above. Scatterplots are used to visualize the bivariate relationships.
Within the scatterplots, we used jitter for the points to clarify the relationship of the variables, especially regarding the discrete x-axis of quality. Furthermore, we plotted a linear model into the diagram, to show the regression of the relationship.
Within the group of acidity, the volatile.acidity variable has the strongest correlation with quality. In comparison pH and especially the citric.acid seem to have only little or no influence on the quality of the wine. The fixed.acidity is only slightly correlated to quality but has the highest concentration as shown in the univariate section.
The residual.sugar has also a high concentration but only a weak correlation with quality. Furthermore, we can detect the bimodality of the distribution of residual.sugar by looking at the density of points within the logarithmic plot. Thus, we have to perform a more detailed exploration of this variable.
To see how the quality is distributed for each category of sweetness, we plot several histograms:
The best correlation within the group of sulfur exists between bound sulfur dioxide and quality. Total sulfur dioxide has a slightly smaller correlation. Since both variables are depending to each other, we might only look at the bound sulfur dioxide variable to explore the impact on quality.
The alcohol content has clearly the best correlation with quality which is also visualized by the regression line in the scatterplot. In further analysis, we also have to explore the alcohol content to other variables to explore more relationships within the dataset.
The chlorides and density variable have both a relatively strong negative correlation with the quality compared to other variables. Thus, we have to take both variables into account in further analysis.
In this section, we want to show the relationship of quality with other variables by using boxplots. Again, we consider different groups as performed in the scatterplot exploration before. Instead of quality, we use quality.ord to group the dataset by quality and to create a boxplot in which we can differ between different levels of quality. Additionally, two boxplots are depicted - one at linear scale and one at logarithmic scale.
The visualization with boxplots allows us to explore the median of a variable for a certain quality. Medians of Weakly correlated variables are almost equal for every order of quality. This can be seen in the boxplots for citric.acid. Here, we also have a lot of data out of the interquartile range.
As described above, the residual.sugar is bimodal resulting in large interquartile ranges in the logarithmic plot. The median of the variable is also alternating between adjacent order of qualities. Thus, the residual.sugar would be more significant, if we put the its values in different bins or buckts to create two groups of wines.
The sulphates and free.sulfur.dioxide variables are weakly correlated to quality, which can also be seen at the almost unvarying median at different orders of quality. However, the total and bound.sulfur.dioxide have their maximimum concentration at a quality of 5. After that the concentration decreases for higher orders of quality resulting in a negative correlation, despite the fact that the median for a quality of 3 and 4 is lower than the concentration at 5.
Within the boxplots of the variable with the strongest correlation - alcohol content - the highest order of quality also has highest pecentage of alcohol. Since only a small number of wines has this high quality (5 white wines), we have a small interquartile range, but one outlier at a lower alcohol concentration. The lowest median of the percentage of alcohol is a at wines with a quality of 5. Thus, there are more factors causing a high quality in white wine than alcohol content.
Density and the chlorides concentration are both negatively correlated with a certain strength, which can also be seen in the boxplots. Here, the median values of both variables decrease at higher orders of quality.
In this section, we want to complete the bivariate exploration and analysis of the dataset. Additionally to the examination above, we present a summary and further observations.
The correlation and plot matrices show that the quality is not clearly correlated to one of the variables in the dataset. The highest correlation coefficients (absolute values) are:
alcohol 0.44
bound.sulfur.dioxide -0.21
chlorides -0.29
density -0.32
We want to take these variables into account for further exploration. Addionally, we want to take a look at the residual.sugar because it shows an interesting behaviour. If we consider the variable with all of its values, we only obtain a correlation coefficient of -0.10 with quality. To get a better understanding of the residual.sugar and its relationship to quality, we explore the scatter plot and correlation for each category of sweetness.
##
## Pearson's product-moment correlation
##
## data: residual.sugar and quality
## t = 7.7601, df = 1707, p-value = 1.451e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1383904 0.2299996
## sample estimates:
## cor
## 0.1845959
##
## Pearson's product-moment correlation
##
## data: residual.sugar and quality
## t = -6.6999, df = 1560, p-value = 2.903e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2150574 -0.1186283
## sample estimates:
## cor
## -0.1672428
##
## Pearson's product-moment correlation
##
## data: residual.sugar and quality
## t = -3.6393, df = 690, p-value = 0.0002938
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.20962021 -0.06335175
## sample estimates:
## cor
## -0.137234
For dry wines, we obtain a positive correlation coefficient of 0.185. However, for wines with higher sweetness like the medium categories, the quality decreases by adding residual sugar.
The bound.sulfur.dioxide variable, which we derived before, shows a negative correlation with quality (-0.21). i.e. increasing the bound.sulfur.dioxide concentration results in lower quality. Additionally, it is interesting, since this variable has a higher correlation coefficient in absolute values related to quality than the total.sulfur.dioxide.
We also obtained an inverse relationship between the concentration of chlorides and the quality of white wines (r = -0.29). This also might result from the correlation of chlorides and alcohol, which is also negative, and the correlation of chlorides with the sulfur.dioxide concentration, which is positve.
The most negative correlation is observed while anayzing the relationship between density and quality (r = -0.32). Since density is positively correlated to other variables, that have an inverse relationship with quality ( e.g. chlorides), the strong negative correlation cofficient might occur from other variables.
(not the main feature(s) of interest)?
Besides the main feature of interest, we also observed the relationship of other variables as shown above. An example is the relationship of (overall) density with the concentration of other variables, because of the specific density of each chemical substance and the amount of the substance within the wine.
The strongest relationship with quality is obtained by analyzing the content of alcohol in white wines. Therefor, we want to focus on further explorations on this combination. Additionally, we cut the alcohol variables into 4 parts to group several wines together.
In an overall correlation test, we obtained the highest correlation coefficient (r = 0.93) between bound.sulfer.dioxide and total.sulfur.dioxide, which is quite obvious, since the first variable is derived by the second.
Another strong relationship exists between residual.sugar and density (r = 0.84). This is also very obvious since the density of sugar is higher than water. Thus, adding sugar will increase the overall density of wine. Additionally, increasing content of alcohol will decrease the density because of the same reasons (r = -0.81).
In this section, we want to explore multivariate relationships. So, we try to identify interesting constellations by plotting and analyzing the information of at least 3 variables of white wines.
At first, we reproduce the scatter plots of the identified features above and add the quality as a third variable.
Alcohol and Chlorides are negatively correlated as shown in the bivariate plots section. This means that wines with low alcohol content have a higher chlorides concentration. The adding of the factorized quality does show a bit of a pattern. We can see that most of the wines with a high alcohol content and a low chloride concentration have a good quality. On the other hand, low quality wines can be found at low alcohol content and higher chloride concentration.
An interesting pattern can be obtained by adding the residual.sugar.bucket variable to the plot. Most of the wines with high alcohol content are dry, whereas medium wines (i.e. wines with more residual sugar) have less alcohol. When we compare this plot with the alcohol-chlorides-quality plot above, we can see that dry wines match with regions of high quality within the plot and medium_dry as well as medium wines match with low quality. This fits to the barplots of quality and residual.sugar faceted by residual.sugar.bucket.
The plots above show the relationship of alcohol and bound.sulfur.dioxide as well as the influence of sugar and the impact on quality. Compared to the alcohol vs. chlorides plot, we also have a negative correlation between alcohol and the bound sulfur dioxide concentration, but in these plots we can observe that dry and medium_dry wines can be found nearly all over the diagram. In the quality plot, we can see again that high alcohol content and low bound sulfur concentration lead to a better quality.
In the plot above, we displayed the relationship between alcohol, residual sugar and quality. To get a better view at the sugar variable, we used a facet_wrap to group the wines into their category of sweetness. Again, we can observe that high alcohol content leads to better quality. However, there are only few wines with an alcohol content higher than 10% in the medium category. Thus, most of the wines with high alcohol content and godd quality can be found in the dry and lower medium_dry division.
The visualization above shows a scatterplot between alcohol content and overall density of the wines grouped by sweetness. The color represents the quality of the wine. Additionally, we plotted a regression line, showing the relationship between alcohol and density. This relationship is negatively correlated as shown by the falling line of the regression plot. The correlation coefficients for “dry”, “medium_dry” and “medium” are:
##
## Pearson's product-moment correlation
##
## data: alcohol and density
## t = -64.322, df = 1707, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8547003 -0.8269575
## sample estimates:
## cor
## -0.8413823
##
## Pearson's product-moment correlation
##
## data: alcohol and density
## t = -63.813, df = 1560, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8634866 -0.8359618
## sample estimates:
## cor
## -0.8503046
##
## Pearson's product-moment correlation
##
## data: alcohol and density
## t = -27.063, df = 690, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7518865 -0.6793703
## sample estimates:
## cor
## -0.7175675
We can see that the correlation coefficient is not equal. A reason for this behaviour might be that density is given as g/L (mass per volume). Since alcohol has a lower density as water and sugar a much higher density, the overall density is affected differently in each group of sweetness. At first, we can see that the plots are shifted to the right because of an increase of residual sugar and, hence, higher density. More sugar per volume also results in less alcohol per volume. The correlation of quality to these variables is shown in the sections before. Here, we can also observe the better quality for higher alcohol content in each group of sweetness. Furthermore, the quality decreases by an increase of density.
The plots above show the relationship between density and chlorides for different groups of sweetness (top) and alcohol content (bottom) and colored by quality. The sweetness grouping shows some strong positive correlations between density and chlorides, which is not the case within the alcohol content group plots. Here, only a slight increase can be recognized. It is obvious, in both visualizations that the density increases with a higher sweetness and decreases with a higher level of alcohol, as described before. Furthermore, the quality is given by the color of the points. In the sweetness grouping, we observe that the best quality exists for low density and low chlorides in each group of sweetness. In the alcohol group plot, the quality changes from group to group as aspected, since there is a strong positive correlation between alcohol and quality. Thus, the points are getting darker from plot to plot as the alcohol content increases.
Another visualization of this relationship can be found here:
The plot at the top shows the relationship of density and chlorides colored by alcohol. The bottom plot is colored by quality. We can see that the regions of wines with high alcohol also contain wines with high quality.
The relationship of density and bound sulfur dioxide is similar to the relationship between density and chlorides. Both are strong positively correlated and show a strong correlation within the three sweetness groups. However, the correlation in the different alcohol content divisions is slightly stronger than the correlation between density and chlorides, but still quite weak. The levels of quality shown in the plots above are high for low density and low concentration of bound sulfur dioxide. Especially, this can be observed in the different groups of sweetness. The affect of the different groups of alcohol is comparable with the density-chlorides plots. Wines with high quality can be found in the high alcohol content group, where the low alcohol content group only contains wines of less quality.
Another visualization of density and bound sulfur dioxide shows that there are certain layers of density with low, medium or high alcohol content which spread over a wide range of bound sulfur dioxide concentration. However, only a few Wines with a high concentration and high density have a medium alcohol content. As expected, high quality wines can be found at low density and low to medium bound sulfur concentration.
Since the concentrations of chlorides and bound sulfur dioxide seem to have a a similar impact on the alcohol content and the quality of wines, we want to explore their relationship with these variables. In the two diagrams above, we can see that the increase of chlorides has a larger impact on the quality than the increase of sulfur dioxide. Although most of the wines with high alcohol content can be found at low chlorides concentration, there are a few which can be found at a higher concentration. In relation to that, wines with high alcohol content at low chloride concentration are spread nearly over the whole range of bound sulfur dioxide.
In the end, we want to take a look at the density vs. residual sugar plots:
Finally, we want to take the Residual Sugar as a continuous variable into account of the exploration of density, alcohol and quality. The last four plots strengthen the results obtained before. As described in the sections above, we have a strong positive correlation between the overall density and residual sugar due to the high specific density of sugar in comparison to water or alcohol. We can see in both visualizations that the number of wines with high alcohol content decreases by increasing the concentration residual sugar. Thus, there aren’t any wines with high alcohol for regions with high residual sugar and high overall density. Since quality ist strong positively correlated to alcohol, we also see that high quality wines are in the same regions as wines with high alcohol content.
This observation summarized and strengthend the results we obtained in the bivariate plots and analyzing section. The percentage of alcohol still has a major impact on the wine’s quality which reflects in the high correlation coefficient. Thus, the wines with the highest quality only can be found in regions with high alcohol content despite the impact of other variables. This is also obvious because the alcohol content is the only variable with a strong positive correlation with quality. A variable which combines the influence of several other variables is the wine’s overall density. Due to different specific densities of the ingredients of wine, the overall density is the result of their combination. Since alcohol has a positive impact on quality combined with a low density and ingredients like sulfur dioxide or sugar negatively influence the quality and have a higher density as water, the overall density is negatively correlated with quality. We can observe this behaviour in the plots in the section above. Therefor, the coloring of the points representing the wines in dependence of the level of quality and the content of alcohol supports the understanding of the relationships.
Most of the interesting interactions are given by the content of alcohol and the overall density as described before. A surprising relationship within the dataset exists between the concentrations of bound.sulfur.dioxide and chlorides. Both variables are negatively correlated with alcohol and quality with nearly the same correlation coefficients. In the bound.sulfur.dioxide vs. chlorides plots, we can see that the increase of chlorides has a larger impact on the quality than the increase of sulfur dioxide. Although most of the wines with high alcohol content can be found at low chlorides concentration, there are a few which can be found at a higher concentration. In relation to that, wines with high alcohol content at low chloride concentration are spread nearly over the whole range of bound sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.901 6.000 9.000
The first major plot, visualizes the distribution of the wines’ quality within the dataset. This plot is created with the cleaned data, i.e. wines, that contain outliers in the observation of other variables, have been removed. We can see that most of the wines have a quality of 6, which is also the median of the distribution. The mean is a bit lower (5.9), due to the second most common quality of white wine, which is 5. Altogether, we can observe a normal distribution of the quality of white wines. However, the scale of quality ranges from 0 to 10, all wines can be found between 3 and 9. Thus, there aren’t any wines with the highest quality or with a quality lower than 3. Because of this small number of levels of quality, it is hard to determine the the real distribution of this variable.
The second major plot shows the relationship of alcohol content and density for different levels of Sweetness. Additionally, the color of the points represents the quality of the wine. In each category (dry, medium dry, medium), we have a strong negative correlation between alcohol and density. The correlation becomes weaker for medium wines. We can also see, that the percentage of alcohol and the overall density are restricted by the level of sugar. Dry wines are able to have a a full range of percentages of alcohol but only a low density less than 0.995 g/cm^3. The limits of restriction are shifted in the other to categories. For the medium level of sweetness, there are only a few wines with a alcohol content larger than 12 %. Additionally to that, none of them has a density lower than 0.9925 g/cm^3. But higher densities larger than 1 g/cm^3 can be achieved. This result is obvious since sugar has a very large specific density compared to water and alcohol. Therefore, the ratio of sugar and alcohol significantly affects the density of wine. This ratio also influences the quality of wine. Wines with a high quality usually have a higher percentage of alcohol, Thus, there is a strong positive correlation between them (more details are shown in Plot Two). In the plotted categories, the color representing the quality gets lighter from top left to bottom right in each diagram. This means, the quality decreases with the decrease of the percentage of alcohol and an increase of density. However, we can detect some high quality wines in the medium division at low alcohol content and high density, which might be a result of the complex distribution of the residual.sugar variable within the dataset.
The third plot shows the relationship of the wines’ alcohol and quality, visualized in a box plot. As described, alcohol percentage is the most influential variable corresponding to quality. Therefor, we want to take a detailed look at this relationship. The plot gives us information about the alcohol content distribution for each level of quality. We added further information like the mean of alcohol for each quality (blue dot) as well as the overall mean (red line) and median (green line) of alcohol in the dataset. At first, we can see that - with the exception of quality levels 3 and 4 - the mean of alcohol increases with quality. This range contains most of the wines of the dataset. Thus, there is a positively correlation, which is apparently little affected by the decrease of the mean of alcohol on the left side of the boxplot. The drop of the average alcohol content at quality = 5 might be a result of a high density due to the concentration of residual sugar. Both, the medians of overall density and residual sugar, are at their maximum at quality 5, when we take a look at the corresponding boxplots in the bivariate plots section. Additionally, the concentration of bound sulfur dioxide is maximal at a quality level of 5, too. Since alcohol is given as percentage of volume, this could lead to a alcohol content far below the overall mean and median. The range of the alcohol content also differs for different qualities. This is because of the normally distributed quality that contains a different number of wines for each level of quality. As shown in Plot One, there are only a few wines with a quality of 9. All of them have a alcohol content between 12% and 13%. Thus, this characteric is quite significant for wines of a high quality.
In this section, we want to summarize the exploration and analysis of the white wine dataset. To begin with the analysis, we plotted histograms of each variable at linear and logarithmic scale to look at the individual distributions and to identify characteristics of the variables. We also created boxplots to detect outliers and to get another view at the distribution. After we analyzed the histograms and boxplots, we defined a threshold for outliers and removed them in the next step. The cleaned variables were saved in a new dataset “wines_new”, which was used in the further exploration and analysis. We also plotted the histograms and boxplots of the clean data again to get a better view at the variable. Within the bivariate analysis, we created a scatterplot and a correlation matrix for all of the continuous variables to get an overview of the relationships. Since we wanted to get more information about our output variable, the wines’ quality, we created scatterplots at linear and logarithmic scale of all variables with quality. Additionally, boxplots were displayed to show the change of the distribution of the variables at different levels of quality. This exploration helped us to identify the main features in our dataset, which had to be analyzed in further steps. In our case, we took a closer look at alcohol, density, bound.sulfur.dioxide. chlorides, residual.sugar and quality. The first four variables have a relatively high correlation coefficient in absolute values corresponding to quality. The residual.sugar variable showed an interesting bimodal behaviour, so cut the variable into several groups of sweetness. This new variable helped us to classify wines and to get a better understanding of the relationships as shown in Plot Two. We also cut the alcohol variable into four different groups of alcohol content to support the visualization in further plots of other variables. These bucket variables also supported the multivariate analysis, hence a colored or faceted visualization could be realized. These plots strengthend the results of the first two sections and helped to understand the relationships within the dataset. Thus, we could show the influence of the content of alcohol to the quality, that presented a positively correlated behaviour.
The work with this dataset was very interesting. I learned a lot about the programming language R and how to visualize variables and data. But I think there are datasets that might be a little bit more suitable for beginners, since there were no obvious correlations. Thus, I spend too much time experimenting with the variables to get a clear correlation or trend. However, the topic itself is very interesting and I would like to have more information about white wines to analyze more features. I also think, it would be interesting to analyze the prices and to compare cheap and expensive wines.